ALTERNATING DIRECTION ALGORITHMS FOR ^-PROBLEMS 
IN COMPRESSIVE SENSING 
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Abstract. In this paper, wc propose and study the use of alternating direction algorithms for several ^i-norm minimization 
problems arising from sparse solution recovery in compressive sensing, including the basis pursuit problem, the basis-pursuit 
denoising problems of both unconstrained and constrained forms, as well as others. We present and investigate two classes of 
algorithms derived from either the primal or the dual forms of the £i-problems. The construction of the algorithms consists of two 
main steps: (1) to reformulate an ^i-problem into one having partially separable objective functions by adding new variables 
and constraints; and (2) to apply an exact or inexact alternating direction method to the resulting problem. The derived 
alternating direction algorithms can be regarded as first-order primal-dual algorithms because both primal and dual variables 
are updated at each and every iteration. Convergence properties of these algorithms are established or restated when they 
already exist. Extensive numerical results in comparison with several state-of-the-art algorithms are given to demonstrate that 
the proposed algorithms are efficient, stable and robust. Moreover, wc present numerical results to emphasize two practically 
important but perhaps overlooked points. One point is that algorithm speed should always be evaluated relative to appropriate 
solution accuracy; another is that whenever erroneous measurements possibly exist, the £i-norm fidelity should be the fidelity 
of choice in compressive sensing. 
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1. Introduction. In the last few years, algorithms for finding sparse solutions of underdetermined 
linear systems have been intensively studied, largely because solving such problems constitutes a critical 
step in an emerging methodology in digital signal processing — compressive sensing or sampling (CS). In 
CS, a digital signal is encoded as inner products between the signal and a set of random (or random-like) 
vectors where the number of such inner products, or linear measurements, can be significantly fewer than 
the length of the signal. On the other hand, the decoding process requires finding a sparse solution of an 
underdetermined linear system. What makes such a scheme work is sparsity; i.e., the original signal must 
have a sparse or nearly sparse representation under some known basis. Throughout this paper we will allow 
all involved quantities (signals, acquired data and encoding matrices) to be complex. Let x £ C n be an 
original signal that we wish to capture. Without loss of generality, we assume that x is sparse under the 
canonical basis, i.e., the number of nonzero components in x, denoted by ||x||o, is far fewer than its length. 
Instead of sampling x directly, in CS one first obtains a set of linear measurements 

b = AxeC m , (1.1) 

where A € C mxn (m < n) is an encoding matrix. The original signal x is then reconstructed from the 
underdetermined linear system Ax = b via certain reconstruction technique. Basic CS theory presented in 
HH] states that it is extremely probable to reconstruct x accurately or even exactly from b provided 
that x is sufficiently sparse (or nearly sparse) relative to the number of measurements, and the encoding 
matrix A possesses certain desirable attributes. 

In the rest of this section, we briefly review the essential ingredients of CS decoding process and some 
existing methods for the relevant optimization problems, summarize our main contributions in this paper, 
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and describe the notation and organization of the paper. 



1.1. Signal Decoding in CS. To make CS successful, two ingredients must be addressed carefully. 
First, a sensing matrix A must be designed so that the compressed measurement b = Ax contains enough 
information for a successful recovery of x. Second, an efficient, stable and robust reconstruction algorithm 
must be available for recovering x from A and b. In the present paper, we will only concentrate on the second 
aspect. 

In order to recover the sparse signal x from the underdetermined system (jl.lj) . one could naturally 
consider seeking among all solutions of Ql.ip the sparsest one, i.e., solving 

min{||a;||o :Ax = b} (1.2) 

where ||a;||o is the number of nonzeros in x. Indeed, with overwhelming probability decoder (|1.2|) can recover 
sparse signals exactly from a very limited number of random measurements (see e.g., [I]). Unfortunately, 
this £o-P r oblem is combinatorial and generally computationally intractable. A fundamental decoding model 
in CS is the so-called basis pursuit (BP) problem [12] : 

min{||x||i : Ax = b}. (1.3) 



Minimizing the ^i-norm in (jT73J) plays a central role in promoting solution sparsity. In fact, problem fL3J) 
shares common solutions with (|1.2[) under some favorable conditions (see, for example, [16j). When b contains 
noise, or when x is not exactly sparse but only compressible, as are the cases in most practical applications, 
certain relaxation to the equality constraint in (| 1 .3|) is desirable. In such situations, common relaxations to 
(|1.3p include the constrained basis pursuit denoising (BP^) problem [T2"] : 

mmiWxlU-.WAx-bhKS}, (1.4) 

and its variants including the unconstrained basis pursuit denoising (QP M ) problem 

mm ||a;||i + — ||Ac - 6||i, (1.5) 



where <5, /i > wee parameters. From optimization theory, it ie well known that problems (|1.4[) and (j 1 . 5[) are 
equivalent in the sense that solving one will determine a parameter value in the other so that the two share 
the same solution. As 5 and ji approach zero, both BP5 and QP M converge to (|1.3| . In this paper, we also 
consider the use of an £ 1 jt^ model of the form 

min ||x||i + -\Ax - (1.6) 

xec n v 

whenever b might contain erroneous measurements. It is well-known that unlike (| 1 . 5[) where squared ^-norm 
fidelity is used, the £i-norm fidelity term makes (|1.6p an exact penalty method in the sense that it reduces 
to (|1.3p when v > is less than some threshold. 

It is worth noting that problems (|1.3p . (|1.4p . (|1.5p and (|1.6p all have their "nonnegative counterparts" 
where the signal x is real and nonnegative. These nonnegative counterparts will be briefly considered later. 
Finally, we mention that aside from ^-related decoders, there exist alternative decoding techniques such as 
greedy algorithms (e.g., |46| ) which, however, are not a subject of concern in this paper. 
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1.2. Some existing methods. In the last few years, quite a number of algorithms have been proposed 
and studied for solving the aforementioned ^i-problems arising in CS. Although these problems are convex 
programs with relatively simple structures (e.g., the basis pursuit problem is a linear program when x is real), 
they do demand dedicated algorithms because standard methods, such as interior-point algorithms for linear 
and quadratic programming, arc simply too inefficient on them. This is the consequence of several factors, 
most prominently the fact that the data matrix A is totally dense while the solution is sparse. Clearly, 
the existing standard algorithms were not designed to handle such a feature. Another noteworthy structure 
is that encoding matrices in CS are often formed by randomly taking a subset of rows from orthonormal 
transform matrices, such as DCT (discrete cosine transform), DFT (discrete Fourier transform) or DWHT 
(discrete Walsh-Hadamard transform) matrices. Such encoding matrices do not require storage and enable 
fast matrix-vector multiplications. As a result, first-order algorithms that are able to take advantage of such 
a special feature lead to better performance and are highly desirable. 

One of the earliest first-order methods for solving (|I.5|) is the gradient projection method suggested 
in [22| . where the authors reformulated (|1.5|) as a box-constrained quadratic program and implemented 
a gradient projection method with line search. To date, the most widely studied first-order method for 
solving (|1.5| is the iterative shrinkage/thresholding (1ST) method, which is first proposed in [211 Ell HI] 
for wavelet-based image deconvolution and then independently discovered and analyzed by many others 
HH |43l |H [13]. In [29[ [30], Hale, Yin and Zhang derived the 1ST algorithm from an operator splitting 
framework and combined it with a continuation strategy. The resulting algorithm, which is named fixed-point 
continuation (FPC), is also accelerated via a non-monotone line search with Barzilai-Borwein steplength [2]. 
A similar sparse reconstruction algorithm called SpaRSA was also studied by Wright, Nowak and Figueircdo 
in [51]. Recently, Beck and Teboulle proposed a fast 1ST algorithm (FISTA) in 3], which attains the 
same optimal convergence in function values as Nesterov's multi-step gradient method [36] for minimizing 
composite functions. Lately, Yun and Toh also studied a block coordinate gradient descent (CGD) method 
in [56] for solving (jl.5|) . 

There exist also algorithms for solving constrained £i-problems (| 1 . 3[) and (|1.4[) . The Bregman iteration 
|38| was applied to the basis pursuit problem in [53] . In the same paper, a linearized Bregman method was 
also suggested and analyzed subsequently in [5] El [54] . In [23] , Friedlandcr and Van den Berg proposed a 
spectral projection gradient method (SPGL1), where l|1.4j) is solved by a root-finding framework applied to 
a sequence of LASSO problems [45]. Moreover, based on a smoothing technique studied in [35], a fast and 
accurate first-order algorithm called NESTA was proposed in 0] for solving (|1.4p . 

In Section [H we present extensive comparison results with several state-of-the-art algorithms including 
FPC, SpaRSA, FISTA and CGD for solving (TO]), and SPGL1 and NESTA for solving (TO]) and (fOj) . 

1.3. Contributions. After years of intensive research on I i-problem solving, it would appear that most 
relevant algorithmic ideas have been either tried or, in many cases, re-discovered. Yet interestingly, until the 
writing of the present paper, the classic idea of alternating direction method (ADM) had not, to the best of 
our knowledge, been seriously investigated. 

The main contributions of this paper are to introduce ADM technique to the area of solving £i-problcms 
in CS and other applications, and to demonstrate its usefulness as a versatile and powerful algorithmic 
tool. Indeed, based on ADM technique we have derived first-order primal-dual algorithms for (|1.3|) . ()1.4D . 
(|1.5|) and (|1.6|) . Furthermore, the versatility of ADM approach has allowed us to develop a Matlab package 
called YALL1 [57] that can solve eight different ^-minimization models, i.e., (|1.3|) . (|1.4j) . (|1.5|) . (|1.6|) and their 
nonncgative counterparts, where local weights are also permitted in the ^i-norm. 
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In this paper, we present extensive computational results to document the numerical performance of 
the proposed algorithms in comparison to several state-of-the-art algorithms for solving ^-problems under 
various situations. As by-products, we also address a couple of issues of practical importance; i.e., choices 
of optimization models and proper evaluation of algorithm speed. 

1.4. Notation. We let || • || be the ^2-norm and Va{ - ) be a projection operator onto a convex set 
n under the ^-norm. Superscripts "T" and "*" denote, respectively, the transpose and the conjugate 
transpose operators for real and complex quantities. We let Re(-) and | • | be, respectively, the real part 
and the magnitude of a complex quantity, which are applied component-wise to complex vectors. Further 
notation will be introduced wherever it occurs. 

1.5. Organization. This paper is organized as follows. In Section 2, we first review the basic idea 
of the classic ADM technique and then derive alternating direction algorithms for solving (| 1 . 3[) . (|1.4[) and 
(|1.5|) . We also establish convergence of the primal-based algorithms, while that of the dual-based algorithms 
follows from classic results in the literature. In Section 3, we present numerical results to compare the 
behavior of model (|1.6p to that of models (|1.4p and (|1.5|) under various scenarios of noise. In Section 4, we 
first re-emphasize the sometimes overlooked common sense on appropriate evaluations of algorithm speed, 
and then present extensive comparison results with several state-of-the-art algorithms to demonstrate the 
performance of the proposed algorithms. Finally, we conclude the paper in Section 5 and discuss several 
extensions of ADM approach to other ^i-like problems. 

2. ADM-based first-order primal-dual algorithms. In this section, based on the classic ADM 
technique, we propose first-order primal-dual algorithms that update both primal and dual variables at each 
iteration for the solution of ^-problems. We start with a brief review on a general framework of ADM. 

2.1. General framework of ADM. Let f(x) : R m — > R and g(y) : R n — > R be convex functions, 
A G R pxr ™, B G R px " and b G R p . We consider the structured optimization problem 

wm{f(x)+g(y) :Ax + By = b}, (2.1) 

where variables x and y are separate in the objective, and coupled only in the constraint. The augmented 
Lagrangian function of this problem is given by 

C A (x,y, A) = f(x) + g(y) - A T (Ax + By - b) + ^\\Ax + By - b\\ 2 , (2.2) 

where A G W is the Lagrangian multiplier and (3 > is a penalty parameter. The classic augmented 
Lagrangian method [HUES] iterates as follows: given A fc G R p , 

(x k+ \y k+1 ) <- axgram X}V JH A (x,y, A fc ), 
\k+i <_ X k - 1 (3(Ax k+1 + By k+1 - b), 

where 76 (0, 2) guarantees convergence, as long as the subproblem is solved to an increasingly high accuracy 
at every iteration [41j . However, an accurate, joint minimization with respect to (x,y) can become costly 
without taking advantage of the separable form of the objective function f(x) + g(y). In contrast, ADM 
utilizes the separability structure in (|2.ip and replaces the joint minimization by two simpler subproblems. 
Specifically, ADM minimizes C A (x, y, A) with respect to x and y separately via a Gauss-Scidel type iteration. 
After just one sweep of alternating minimization with respect to x and y, the multiplier A is updated 
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immediately. In short, given (y k ,X k ), ADM iterates as follows 

x k+1 <- argmin^ C A (x, y k , X k ), 

y k+1 «- argmm v £ A (x k+1 ,y,\ k ), (2.4) 
A fe+i <_ X k _ 1 f 3 (Ax k+1 + By k+1 - b). 

The basic idea of ADM goes back to the work of Glowinski and Marocco [28] and Gabay and Mercier 
|25j . Let #i(-) and $2( # ) be convex functionals, and A be a continuous linear operator. The authors of [25] 
considered minimizing an energy function of the form 

min0i(u) + 6 2 {Au). 

u 

By introducing an auxiliary variable v, the above problem was equivalently transformed to 

min{0i(u) + 9 2 {v) : Au - v = 0} , 

which has the form of (|2.1[) and to which the ADM approach was applied. Subsequently, ADM was studied 
extensively in optimization and variational analysis. In [27], ADM is interpreted as the Douglas- Rachford 
splitting method [T7] applied to a dual problem. The equivalence between ADM and a proximal point 
method is shown in [18| . ADM is also studied in convex programming |24| and variational inequalities 
[47] [32] . Moreover, ADM has been extended to allowing inexact sub-problem minimization [TBI I3T]. 

In (|2.4p . a steplength 7 > is attached to the update of A. Under certain technical assumptions, 
convergence of ADM with a steplength 7 G (0, (vo + l)/2) was established in (23 [27J in the context 
of variational inequality. The shrinkage in the permitted range from (0, 2) in the augmented Lagrangian 
method to (0, (v5 + l)/2) in ADM is related to relaxing the exact minimization of Ca(x, y, X k ) with respect 
to (x, y) to merely one round of alternating minimization. Recently, ADM has been applied to total variation 
based image restoration and reconstruction in [501 E2- I n the following, we apply ADM technique to (|1.4p 
and (|1.5p . while the application to (jl.3| and (jl.6| will be a by-product. 



2.2. Applying ADM to primal problems. In this subsection, we apply ADM to primal ^-problems 
(|1.4p and (|1.5p . First, we introduce auxiliary variables to reformulate these problems into the form of (|2.1[) . 
Then, we apply alternating minimization to the corresponding augmented Lagrangian functions, cither 
exactly or approximately, to obtain ADM-like algorithms. 

With an auxiliary variable r € C Tn , problem (|1.5p is clearly equivalent to 



min <f P (r,x)±\\x\\ l + — \\r\\ 2 : Ax + r = b , (2.5) 
that has an augmented Lagrangian subproblem of the form 

min j ||x||i + -!-||r|| 2 - Re(y*(Ax + r - b)) + ~\\Ax + r - b\\ 2 \ , (2.6) 

16C™, rSC m 2/J, 2 J 

where y G C m is a multiplier and (3 > is a penalty parameter. Given (x k ,y k ), we obtain (r +1 , x +1 , y k+1 ) 
by applying alternating minimization to (|2.6p . First, it is easy to show that, for x — x k and y — y k fixed, 
the minimizer of (|2.6p with respect to r is given by 

r k+1 = {V k /P - {Ax k - b)) . (2.7) 
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Second, for r — r k+1 and y = y k fixed, simple manipulation shows that the minimization of (|2.6p with 
respect to x is equivalent to 

min || a .|| 1 + ^|| J /b; + r fc+1 - b - y k / f3\\ 2 , (2.8) 

which itself is in the form of Q1.5p . However, instead of solving (|2.8p exactly, we approximate it by 

min \\ x \\ 1+ f]f(g k y( x - x k ) + L\\ x ^ x kA f (2.9) 
i£C™ y It J 

where r > is a proximal parameter and 

g k 4 A* (A X k + r k+1 - b - y k / f3) (2.10) 

is the gradient of the quadratic term in (|2.8[) , (3 not included, at x = x k . The solution of (|2.9p is given 
explicitly by 

x k+1 = Shrink fx k - rg k , 4 max jV' - rg fc | - ^,o| o sign(a: fc - rg k ), (2.11) 

where "o" represents component-wise multiplication. The operation defined in (|2. 1 1[) is well-known as the 
one-dimensional shrinkage (or soft thresholding). Finally, we update the multiplier y by 

y k+1 = y k - 1 f3{Ax k+1 + r k+1 - b), (2.12) 



^-(y k //3-(Ax k -b)) 



where 7 > is a constant. In short, ADM applied to (| 1 . 5[) produces the iteration: 

r k+l _ H$ 

x k+1 = Shrink [x k - rg k , -|) , (2.13) 
y k+i = y k _ ^p(Ax k+1 + r k+1 - b). 

We note that (|2.13p is an inexact ADM because the x-subproblem is solved approximately. The convergence 
of (|2.13[) is not covered by the analysis given in [TB] where each ADM subproblem is required to be solved 
more and more accurately as the algorithm proceeds. On the other hand, the analysis in [31] does cover the 
convergence of (|2.13p but only for the case 7 = 1. A more general convergence result for (|2.13[) is established 
below that allows 7 > 1. 

Theorem 2.1. Let r, 7 > satisfy rA ma x + 7 < 2, where A max denotes the maximum eigenvalue of 
A* A. For any fixed [3 > 0, the sequence {(r k ,x k ,y k )} generated by (|2 . 13[> from any starting point (x°,y°) 
converges to (r,x,y), where (f,x) is a solution of (|2.5p . 

Proof. The proof is given in the Appendix. □ 

A similar alternating minimization idea can also be applied to problem (|1.4|) . which is equivalent to 

min : Ar + r = 6,j|rj| <<5}, (2.14) 

iEC™, rGC m 

and has an augmented Lagrangian subproblem of the form 

min {\\ X \\i-y*{Ax + r-b) + -\\Ax + r-b\\ 2 : ||r|| < s\ . (2.15) 

Similar to the derivation of (|2.13[) . applying inexact alternating minimization to (|2. 15[) yields the following 
iteration scheme: 

r k+1 =V Bs (y k /[3-(Ax k -b)), 

x fe+1 = Shrink^ - rg k , t/(3), (2.16) 



!) 



k+l = y k _ 7/ 3(^ X fc+l + _ 
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where g k is as defined in (|2.10[) . and Vb 6 is the projection onto the set Bg = {£ G C m : ||£|| < 5} (in 

Euclidean norm) . This algorithm also has a similar convergence result as (|2 . 1 3[) . 

Theorem 2.2. Let r, 7 > satisfy rA max + 7 < 2, where A max denotes the maximum eigenvalue of 

A* A. For any fixed (3 > 0, the sequence {(r k ,x k ,y k )} generated by (|2.16[) from any starting point (x°,y°) 

converges to (f, x,y), where (r,x) solves ()2.14j) . 

Proof. The proof of Theorem 12.21 is similar to that of Theorem 12.11 and thus is omitted. □ 
Remark 1. We point out that, when /j, = S = 0, both (|2.13p and ()2.16p reduce to 

j x k+1 = Shrink (x k - rA*{Ax k -b- y k /f3),r/f3) , 

j y k + l = y k _ 7 j3(Ax k+1 ~ b), 

which is an algorithm for solving the basis pursuit problem (|1.3[) . Using similar techniques as in the proof for 
Theorem \2.1[ we can establish the convergence of (|2.17p to a solution of (|1.3p . The only difference between 
(|2.17p and the linearized Bregman method proposed in [53] lies in the updating of multiplier. The advantage 
of (|2.17p is that it solves (jl.3p . while the linearized Bregman method solves a penalty approximation of (jl.3p 
(see e.g., fSJI). 

Since we applied the ADM idea to the primal problems (|1.3p . (|1.4p and (|1.5p . we name the resulting 
algorithms (|2.13p . (|2.16p and (|2.17p primal-based ADMs or PADMs in short. In fact, these algorithms are 
really of primal-dual nature because both the primal and the dual variables are updated at each and every 
iteration. In addition, these are obviously first-order algorithms. 

2.3. Applying ADM to dual problems. Similarly, we can apply the ADM idea to the dual problems 
of p.4p and (|1.5p . which results to equally simple yet more efficient algorithms. First, we assume that the 
rows of A are orthonormal, i.e., A satisfies AA* = I. We will remark on how to extend the treatment to a 
non-orthonormal matrix A at the end of this section. 

Simple computation shows that the dual of (|1.5p or equivalently (|2.5p is given by 

max min ilMliH \\r\\ 2 — He(y* (Ax + r — b)) 

ygC m i£C" ,rec m 2/i 

= max {Re(b*y) - ^\\y\\ 2 + min (||x||i - Ile{y*Ax)) + — min ||r - ny\\ 2 
yec m [ z xec™ Zfi rec m 



max 

yet 



{Re(&*y)-||M| 2 :A*yeBr>}, (2.18) 

where BJ° = 6 C' 1 : ||^||oo < !}• By introducing z S C", (|2.18p is equivalently transformed to 

max \f d {y) 4 Ke(b*y) - ^\\y\\ 2 : z - A*y = 0, z G B? 3 } , (2.19) 
which has an augmented Lagrangian subproblem of the form 



mm 



Re(b*y) + %\\ 2 Re(x*(z - A*y)) + hz- A*y\\ 2 , z G B?° k (2.20) 



yeC m ,zGC" l_ 2 2 

where a; G C™ is a multiplier (in fact, the primal variable) and (3 > is a penalty parameter. Now we apply 
the ADM scheme to (|2.19p . i.e., alternatingly update the multiplier (or the primal variable) x G C™ and the 
dual variables y G C m and z G C n . First, it is easy to show that, for x = x k and y = y k , the minimizer 
of (|2.20p with respect to z is given explicitly by 

z k+1 =V Br {A*y k +x k /(3), (2.21) 
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where, as in the rest of the paper, V represent a projection (in Euclidean norm) onto a convex set denoted 
as a subscript. Second, for x = x k and z = z k+1 , the minimization of (|2.20|) with respect to y is a least 
squares problem and the corresponding normal equations are 

QjJ + (3AA*)y = (3Az k+1 - (Ax k - b). (2.22) 

Under the assumption AA* = /, the solution y k+1 of (j2.22p is given by 

y k+1 = -XT? ( Azk+1 - ( Axk - b )/P) ■ ( 2 - 23 ) 

Finally, we update x as follows 

x k+i = x k _ 7/3 ( z ^+i _ A*y k+1 ), (2.24) 

where 7 e (0, (VE + l)/2). Thus, the ADM scheme for (|2~T9|) is as follows: 

z k+1 = V Br (A*y k + x k //3), 

y k+1 = A ( Azk+1 ~ ^ ~ b VP) ' (2 - 25) 

x k+1 = x k - j(3(z k+1 - A*y k+1 ). 



Similarly, the ADM technique can also be applied to the dual of Q 1.4)1 given by 

max {b*y - 5\\y\\ : A*y e B?° }, (2.26) 

i/GC m 

and produces the iteration scheme 

z k+1 =V BT (A*y k + x k /(3), 

y k+1 = S (Az k+1 - (Ax k - b) /p, 8/(5) , (2.27) 
x k+i =x k _ 7/ 3( z fe+i _ A*y k+1 ), 

where S(v,5/j3) = v — T*b s , p (v) with "Bg/p being the Euclidian ball in C m with radius 8/j3. 

Remark 2. Under the assumption AA* = /, (|2.25j) is an exact ADM in the sense that each subproblem 
is solved exactly. From convergence results in \2b\ \ 27j , for any > and 7 £ (0, (V5 + l)/2), the sequence 
{(x k ,y k ,z k )} generated by ([2.25)1 /rom any starting point (x°,y°) converges to (x,y,z), which solves the 
primal-dual pair ([1.5)1 and (|2.19p . Similar arguments apply to (|2.27[) and t/ie primal-dual pair (|1 .4[) and 

o. 

Derived from the dual problems, we name the algorithms ([2.25)1 and ([2.27)1 dual-based ADMs or simply 
DADMs. Again we note these are in fact first-order primal-dual algorithms. 
It is easy to show that the dual of (|1.3p is given by 

max {Re(b*y) : A*y e Bf} , (2.28) 

yeC m 

which is a special case of (|2.18j) and (|2.26|) with fj, = 8 = 0. Therefore, both ([2.25)1 and (|2.27j) can be applied 
to solve (jl.3p . Specifically, when fi = 8 = 0, both ([2.25P and ([2.27P reduce to 

z k + l = V BT (A*y k + x k /!3), 

y k+i = Az k+i _ ^ Ax k _ 6 ) (2.29) 
^fc+i =x k _ 1 p( yZ k+i _ A*y fc+1 ). 
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We note that the last equality in (|2.18p holds if and only if r = fiy. Therefore, the primal-dual residues 
and the duality gap between (|2.5|) and (|2. 19[) can be defined by 

r p = Ax + r — b = Ax + fj,y — b, 

r d 4 A*y - z, (2.30) 
A = U(y) - f P (x, r) = Re(b*y) - ^\\y\\ 2 - \\x\\ v 

In computation, algorithm (|2.25|) can be terminated by 

Res 4 {||r p ||/||6||, \\r d \\/V^, A/f p (x,r)} < e (2.31) 

where e > measures residue in optimality. 

When AA* ^ I, the solution of (|2 . 22[) is costly. In this case, we take a steepest descent step in the y 
direction and obtain the following iteration scheme: 



(2.32) 



z k+l 




(A*y k + x k /f3), 


yk + 1 


= y k - 




x k+l 


— x k - 


jl3(z k+1 - A*y k+1 ), 



where, at the current point (x k , y k , z k ), g k and are given by 

g k = fiy k + Ax k - b + j3A(A*y k - z k+1 ) and a* k 



(g k )*g k 



(g k )*(fiI + (3AA*)g k - 



In our experiments, algorithm (|2.32|) converges very well for random matrices where AA* ^ I, although its 
convergence remains an issue of further research. Similar arguments apply to (|2.26jl . 

The ADM idea can also be easily applied to ^i-problems for recovering real and nonnegative signals. As 
an example, we consider model ()1.5j) plus nonnegativity constraints: 



min <{ 11x11! + -L\\Ax-b\\ 2 :x>Q\, (2.33) 

zSR™ [2/1 J 

where (A,b) can remain complex, e.g., A being a partial Fourier matrix. A similar derivation as for (|2 . 
shows that a dual problem of (|2.33|) is equivalent to 

max f Ke(b*y) - %f : z - A*y = 0, z 6 t\ , (2.34) 

where T — {z £ C' 1 : Re(z) < 1}. The only difference between (|2.34p and (|2.19|) lies in the changing of 
constraints on z from z G Bf 3 to z E T . Applying the ADM idea to (|2.34[) , i.e., alternately update the 
primal and dual variables, yields an iterative algorithm with the same updating formulae as (|2.25[) except 
the computation for z k+1 is replaced by 

z k+1 = Vjr(A*y k + x k /l3). (2.35) 



It is clear that the projection onto T is trivial. The same procedure applies to the dual problems of other 
^i-problems with nonnegativity constraints as well. Currently, with simple optional parameter settings, our 
Matlab package YALL1 can be applied to solve (|1.3p . (|1.4[) , (|1.5[) . (|1.6p and their nonnegative counterparts. 
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3. Choice of denoising models. In most practical applications, measurements are invariably contam- 
inated by some noise. To date, the most widely used denoising models in CS include (|1.5[) and its variants, 
which are based on the £2-norm fidelity. An alternative approach is to penalize Ax — b by the £i-norm, 
resulting the denoising model (|1.6[) . Between the two types of models, which type should be preferred 

in general? So far, model (| 1 . 5(1 . along with its variants, is widely used and appears to be the de facto model 
of choice. However, we provide supporting evidence to argue that perhaps in most practical situations the 
l\ I ' i\ model (|1.6[) should be preferred unless there is an excessive amount of white noise in observed data, in 
which case one can only expect to obtain low-quality solutions. On the other hand, if there is any possibility 
that observed data may contain large measurement errors or impulsive noise, model fll.Gp can potentially be 
dramatically better than (|1 .5[) (see also, e.g., [50]). 

We conducted a set of experiments comparing models (|1.4[) . (|1.5|) with (|1.6[) on random problems with 
n = 1000, m = 300 and k = 60, using the solver YALL1 [57], which implements the dual ADMs described 
in Subsection 12. 31 for eight different models, including model (|1.6p that is reformulated and solved as a basis 
pursuit problem in the form of (|1.3p . The reformulation is as follows. Clearly, problem (|1.6[) is equivalent to 

min {;/||:r||i + |?'j|i : Ax + r = b} , 

i£C" ,rSC m 

which, after a change of variables, can be rewritten as miii£ e cn+m{||i||i : Ax = &}, where 

A= {AVI \ b= / b ^x={ VX \. (3.1) 

We note that AA* = I provided that AA* = I. In our experiments, each model is solved for a sequence of 
parameter values varying from to 1. The simulation of data acquisition is as follows 

b = Ax + p = Ax + pw + Pi = bw + Pi, (3.2) 

where p,pw,Pi represents, respectively, the total, the white and the impulsive noise, and bw is the data 
containing white noise only (or noiseless whenever pw = 0). White noise is generated by the MATLAB 
command randn(n,l) multiplied by an appropriately chosen constant to attain a desired signal-to-noise 
ratio, while impulsive noise values are set to be ±1 at random positions of b that is always scaled so that 
IHloo = 1- As is usually done, we measure the signal-to-noise ratio (SNR) of both bw and b in terms of 
decibel (dB). In this paper, we use the following definition of SNR of b: 

SNRm = 201 „ gIO (£^™), 

where E(6) represents the mean value of b. The SNR of bw is defined similarly. Impulsive noise is measured 
by percentage, e.g., 5% impulsive noise means 5% of elements in b are erroneous. For a computed solution 
x from a data vector b defined in (|3.2p . the relative error in x is defined in terms of percentage: 

RelErr(a;) = x 100%. (3.3) 

INI 

Figure |3~T1 presents results for pw = and pi changes from 1% to 10%. From the results on the top 
row of Figure 13. R it is quite clear that model (|1.6p is able to recover the exact solution i to a high accuracy 
for a range of v values (although the range for high quality recovery shrinks when the corruption becomes 
increasingly severe), while model (|1.4p is not, even though in all cases it reduces relative errors by about 5% 
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when v is close to 1 (we tried even larger v values but the achievable improvement soon saturates at that 
level). From the results on the bottom row of Figure [3~Tl model (|1.5[) behaves similarly as (jl.4[) , i.e., the 
quality of recovered signals is much worse than those from (|1.6[) . Therefore, there should be no doubt that 
model (|1.6[) is superior to (|1.4[) and Q1.5P provided that v is chosen properly and should be the model of 
choice whenever there is a possibility of erroneous measurements in data no matter how small the percentage 
might be. 



Impulsive noise only: 1% 5% 10% 




(i values u. values n values 



Fig. 3.1. Recovered results from data corrupted by impulsive noise (from left to right in both rows: 1%, 5% and 10%). Top 
row: recovered solutions from BP U (5 <— v in HI. 41 1 ) and problem 111.61 1 (the x-axes represent v values in both models); 

Bottom row: recovered solutions from QP^ (the x-axes represent ji values in the model). In all plots, the y-axes represent 
relative errors of recovered solutions to the true sparse signal. 

Which model should be used when data contain both white and impulsive noise? Let us examine results 
given in Figure EOl where (|1.6|) is compared with (|1.4[) with data satisfying SNR(&vk) = 40dB (first row) 
and 20dB (second row). In both cases pj takes values of 1%, 5% and 10%. Similar to the noiseless case (free 
of white noise) , evidence strongly suggests that (|1.6|) should be the model of choice whenever there might be 
erroneous measurements or impulsive noise in data even in the presence of white noise. We did not present 
the results of (|1.5p since they are similar to those of p. 41) . 

Is model (|1.6[) still appropriate when there is no impulsive noise? Figure 13.31 contains results obtained 
from data with white noise only of SNR(6) = SNR(&w') = 30dB, 20dB and lOdB, respectively. Qualitatively 
and loosely speaking, these three types of data can be characterized as good, fair and poor, respectively. 
As can be seen from the top left plot, on good data (|1.4p offers no improvement whatsoever to the basis 
pursuit model (y — 0) as v decreases. On the contrary, it starts to degrade the quality of solution once 
v > 0.25. On the other hand, model (| 1 . 6[) essentially does no harm until v > 0.7. From the top middle plot, 
it can be seen that on fair data both models start to degrade the quality of solution after v > 0.7, while 
the rate of degradation is faster for model (|1.6p . Only in the case of poor data (the top right plot), model 
(jl.4p always offers better solution quality than that of model (|1.6p . However, for poor data the solution 
quality is always poor for v £ [0,1] (and beyond as well). At v = 1, the relative error of model (|1.4p is about 




Fig. 3.2. Recovered results from data corrupted by both white and impulsive noise. SNR of byy *s J^OdB (top row) and 
20dB (bottom row); In each row, the ratios of impulsive noise corruption are 1%, 5% and 10% from left to right; x-axes: model 
prameters in BP V (8 «— v in 11,41 1 ) and model II. 6H ; y-axes: relative errors of recovered solutions to the true signal. 



38%, representing a less than 5% improvement over the relative error 42% at v = 0.05, while the best error 
attained from model (| 1 . 6[) is about 40%. The results of (|1.5p are generally similar to those of (|1.4[) provided 
that model parameters are selected properly, as is shown on the bottom row of Figure 13.31 

The sum of the computational evidence suggests the following three guidelines, at least for random 
problems of the type tested here: 

1. Whenever data may contain large measurement errors, erroneous observations or in general impulsive 
noise, the £i-norm fidelity used by model (| 1 . 6|) should naturally be preferred over the ^2-norm fidelity 
used by model (|1.4p and its variants. 

2. Without impulsive noise, the I i-norm fidelity does no harm to solution quality, as long as data do 
not contain a large amount of white noise and v remains reasonably small. 

3. When data are contaminated by a large amount of white noise, the ^2-norm fidelity should be 
preferred. In this case, however, high-quality recovery should not be expected regardless what 
model is used. 

4. Numerical results. In this section, we compare the proposed ADMs for t\ problems, which will 
be referred as PADM and DADM, respectively, with several state-of-the-art algorithms to demonstrate their 
practical performance. The rest of this section is organized as follows. In Section ^. 11 we present experimental 
results to emphasize a simple yet often overlooked point that algorithm speed should be evaluated relative to 
solution accuracy. In Section [4.2[ we describe experimental settings, including parameter choices, stopping 
rules and generation of problem data under MATLAB environment. In Section 14.31 we compare PADM and 
DADM with FPC-BB [29j[30], a fixed-point continuation method with non-monotone line search based the 
Barzilai and Borwein (BB) steplcngth [2], SpaRSA [5T] — a sparse reconstruction algorithm for more general 
regularizers, FISTA [3] — a fast 1ST algorithm that attains an optimal convergence rate in function values, 
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Fig. 3.3. Recovered results from data contaminated by white noise only. In both rows, SNR of byy : 30dB, 20dB and lOdB 
from left to right. Top row: results of BP V (5 <— v in 11.41 ) and £i/£i model 11.611 ; Bottom row: results of QP/j,; In all plots, 
the x-axes represent model parameters and the y-axes represent relative errors of recovered solutions to the true signal. 



and CGD [56j — a block coordinate gradient descent method for minimizing ^i-regularized convex smooth 
function. In Section POl we compare PADM and DADM with SPGL1 [53] — a spectral projected gradient 
algorithm for p.4[) . and NESTA [4] — a fast and accurate fast first-order algorithm based on Nesterov's 
smoothing technique [35]. All experiments were performed under Windows Vista Premium and MATLAB v7.8 
(R2009a) running on a Lenovo laptop with an Intel Core 2 Duo CPU at 1.8 GHz and 2 GB of memory. 

4.1. Relative error versus optimality. In algorithm assessment, the speed of an algorithm is often 
taken as an important criterion, and rightfully so. However, speed is a relative concept and ought to be 
used within context, which unfortunately has not always been the case. In track and field, human speed is 
measured relative to the distance of a race. The fastest short distance runner may very well be a rather slow 
long distance runner and vice versa. Similarly, algorithm speed should be measured relative to accuracy. 
Using a single accuracy to judge algorithm speed could be as misleading as comparing the speeds of short 
and long distance runners. 

Clearly, the definition for a good accuracy can vary with situation and application. Consequently, 
algorithm speed should be judged within concrete context. A relevant question here is what kind of accuracy 
is reasonable for solving compressive sensing problems, especially when data are noisy as is the case in most 
real applications. To answer this question, we solved (|1.3p with noiseless data and solved (|1.5j) with data 
contaminated by white noise of small to moderate levels. In this experiment, the measurement matrix was 
constructed by orthogonalizing and normalizing the rows of a 330 by 1000 standard random Gaussian matrix. 
The true signal x has 60 nonzcros whose positions are determined at random, and the nonzero values are 
random Gaussian. Both problems (|1.3p and (|1.5|) were solved by DADM to a relative high accuracy. The 
results on relative error defined in (|3.3|) and optimality measured in residue defined in (|2.31[) are given in 
Figure OJ 
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Noiseless data SNR(b): 40dB SNR(b): 20dB 




50 100 150 200 250 50 100 150 200 250 20 40 60 80 
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Fig. 4.1. Relative error versus optimality for noiseless and noisy data. The x-axes represent number of iterations, and 
y-axes represent relative error (RelErr) defined in (13 .3D and residue (Res) defined in d 2 . 31 I t - 

It is clear from Figure |4~T1 that solving £i-problems to high accuracy improves solution quality only when 
the observed data are free of noise. In the left plot of Figure I4TT1 where noiseless data were used in (|1.3[) . both 
relative error and optimality measured by residue decrease as DADM proceeds. For noisy data, a relatively 
low accuracy is sufficient to give the best relative error that an £i-denoising model can reach, e.g., in the 
middle plot of Figure 14.11 where low level noisy data were used in (|1.5[) , relative error does not decrease 
anymore after the residue is reduced to about 10 -2 in about 40 iterations. This phenomenon becomes more 
and more obvious for higher level noisy data, as is shown in the right plot of Figure 14.11 Based on these 
observations, we conclude that solving £i-problems to high accuracy is not necessary whenever observed data 
are contaminated by noise. We choose to emphasize this rather mundane point because too often such a 
common sense has been ignored in algorithmic studies. In our numerical comparison, whenever noisy data 
are used we will not compare how fast algorithms achieve a high accuracy, but how fast they achieve an 
appropriate accuracy that is consistent with the noise level of data. 

4.2. Experimental settings. Now we describe experimental settings, including selection of param- 
eters, stopping rules and generation of data in MATLAB. In our experiments, we mainly tested randomized 
partial transform matrices as encoding matrices for reasons given below. First, partial transform matrices 
do not need explicit storage and thus are suitable for large scale tests. Second, fast matrix-vector multipli- 
cations are available, which are the main computation load of all algorithms to be compared with. Finally, 
for this class of matrices there hold AA* = I, which is not only useful in parameter choices in PADM but 
also required in DADM for exact minimization of subproblcms. We note that condition AA* = I implies 
X mlxx (A*A) = 1. Therefore, for convergence of PADM, in our experiments we set r = 0.8, 7 = 1.199 and 
f3 — 2m/\\b\\i in (|2. 13[) and (|2.16p , which work quite well in practice, although suitably larger r and 7 seem 
work better most of the time. For DADM, we used the default settings in YALL1, i.e., 7 = 1.618 and 
j3 = |[&||i/m. As described in subscction l4.1( high accuracy is not always necessary in CS problems especially 
for noisy data. Thus, when comparing with other algorithms, we simply terminated PADM and DADM 
when relative change of two consecutive iterates becomes small, i.e., 

ii™fc+l _ „,fcn 

RelChg^ l|X X 11 <e, (4.1) 

F I 

where e > is a tolerance, although more complicated stopping rules, such as the one based on optimality 
conditions defined in arc possible. Parametric settings of FPC-BB, SpaRSA, FISTA, CGD, SPGL1 

and NESTA will be specified when we make comparison with them. In all experiments, we generated data 
b by MATLAB scripts b = A*xbar + sigma*randn(m, 1) , where A is a randomized partial Walsh-Hadamard 
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transform matrix, xbar represents a sparse signal that we wish to recover, and sigma is the standard deviation 
of additive Gaussian noise. The partial Walsh-Hadamard transform is implemented in the C language with 
a MATLAB mex-interface available to all codes compared. In all tests, we set n = 8192 and tested various 
combinations of m and k (the number of nonzero components in xbar). We initialized x to the zero vector 
in all algorithms unless otherwise specified. 

4.3. Comparison with FPC-BB, SpaRSA, FISTA and CGD. In this subsection, we present 
comparison results of PADM and DADM with FPC-BB [SUED], SpaRSA [51], FISTA [3] and CGD [56], all 
of which were developed in the last two years for solving (|1.5p . In the comparison, we used random Gaussian 
spikes as xbar, i.e., the location of nonzeros arc selected uniformly at random while the values of nonzero 
components are iid Gaussian entries. The standard deviation of additive noise is set to be 10 -3 , while model 
parameter [i in (|1.5[) is set to be either 10~ 3 or 10~ 4 , both of which are suitable values for small noise. Wc 
aimed to compare both the efficiency and robustness of the algorithms. 

Since different algorithms use different stopping criteria, it is rather difficult to compare their relative 
performance completely fairly. Therefore, we present two classes of comparison results. In the first class 
of results, we run PADM and DADM for a prescribed number of iterations, and choose proper stopping 
tolerance values for other algorithms so that their iteration numbers are roughly euqal to the prescribed 
iteration number. Then we examine how relative errors and function values decrease as each algorithm 
proceeds. In the second class of results, we terminate the ADM algorithms by (|4.ip . while the stopping rules 
used for other algorithms in comparison will be specified below. 

For the purpose of clarity, we first compare PADM and DADM with FPC-BB and SpaRSA, and then with 
FISTA and CGD. Since FPC-BB implements continuation on rcgularization parameter but not on stopping 
tolerance, we set all parameters as default except in the last step of continuation we let xtol = 10 -5 and 
gtol = 0.02, which is more stringent than the default setting xtol = 10~ 4 and gtol = 0.2 because the 
latter usually produces solutions of lower quality than that of other algorithms in comparison. For SpaRSA, 
we used its monotonic variant, set continuation steps to be 20 and terminated it when relative change in 
function values falls below 10~ 7 . Since the per-iteration cost is roughly two matrix-vector multiplications 
for all compared algorithms, it is helpful to examine the decreasing behavior of relative errors and function 
values as functions of iteration numbers. Figure W/Z\ presents the results of two different combinations of m 
and k. Each result is the average of 50 runs on randomly generated data. 

As can be seen from Figure S21 that, compared with FPC-BB and SpaRSA, PADM and DADM usually 
decrease relative errors and function values faster. Specifically, the relative error and function value curves 
of both PADM and DADM fall below those of FPC-BB and SpaRSA almost throughout the entire iteration 
process. With no more than 100 iterations, PADM and DADM reached lowest relative errors and then 
started to increase, which is a problem of the model (|1.5|) rather than the algorithm since function values 
keep decreasing. It is also clear that all algorithms attain nearly equal relative errors and function values. 

Next wc compare DADM with FISTA and CGD. Started at x°, FISTA iterates as follows 

x k+1 = Shrink (y k - rA*(Ay k -b),r/n), 

where r > is a parameter, and 

k _( x°, if = 0; whcre t f 1, iffc = 0; 

V ~ I x k + t -^{x k ~x k ^), otherwise, W ^ k ~\ ]±Vl+^ L } otherwise. 

It is shown in [3] that FISTA attains an optimal convergence rate 0(l/k 2 ) in decreasing function values, 
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Fig. 4.2. Comparison results of PADM, DADM, FPC-BB and SpaRSA on l|l,5J l. The x-axes represent number of 
iterations, and y-axes represent relative errors (plots on the left) and function values (plots on the right). The standard 
deviation of Gaussian noise is sigma = 10~ 3 . The results are average of 50 runs. 



where k is the iteration counter. By letting 4^ = 1, FISTA reduces to the well-known primary iterative-soft- 
thresholding (1ST) algorithm. Figure 14.31 shows the comparison results of DADM and FISTA, along with 
those of the primary 1ST algorithm in order to illustrate the advantage of FISTA over 1ST (after a small 
modification). Since the relative performance of PADM and DADM on problem (|1.5[) has been illustrated 
in Figure [4~2l and the dual approach seems to be slightly more efficient than the primal one, here we merely 
presented the results of DADM for simplicity. In this experiment, we set r = 1 for both FISTA and 1ST 
and initialized all algorithms at x = A*b rather than at the origin in order to make function values fall into 
a relatively small range. 

It can be seen from Figure |4~31 that FISTA converges much faster than 1ST. However, without heuristic 
techniques such as continuation and line search as incorporated in many algorithms, FISTA is generally 
slower than DADM. The slow convergence of FISTA and 1ST as compared with DADM becomes more 
pronounced when fj, becomes smaller in which case ([1.50 becomes more difficult. From Figure 14. 3[ when 
/i = 10 -3 , FISTA converges to a nearly optimal function value within no more than 200 iterations, while 
1ST consumes more than 1000 iterations. When fi is decreased from 10~ 3 to 10~ 4 , both FISTA and 1ST 
become slower, while the convergence of DADM is not affected. 

Similarly, the comparison results of DADM with CGD are given in Figure 14. 4[ along with those of 
FPC-BB in order to evaluate the relative performance of CGD. In this experiment, we used the continuation 
variant of CGD (the code CGD_cont in the MATLAB package of CGD) and set all parameters as default except 
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Fig. 4.3. Comparison results of 1ST, FISTA and DADM on HI. 51 1 . The x-axes represent number of iterations, and y-axes 
represent function values, both in logarithmic scale. The standard deviation of Gaussian noise is 10~ 3 . Left plot: fi = lO" 3 ; 
Right plot: fi = 10~ 4 . The results are average of 50 runs. 
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Fig. 4.4. Comparison results of FPC-BB, CCD and DADM on dl.51 ■ The x-axes represent number of iterations, and 
y-axes represent relative errors (plots on the left) and function values (plots on the right), both in logarithmic scale. The 
standard deviation of Gaussian noise is 10 — 3 . The results are average of 50 runs. 

setting the initial \x value to be max(0.01||A T x°|| oo , 2/i) which works better than the default setting in our 
tests when fi is small. It can be seen from the plot on the right of Figure 14.41 that CGD decreases the 
function value faster than FPC-BB. However, from the plot on the left, CGD does not necessarily decrease 
the relative error to the true signal faster than FPC-BB. In comparison, DADM converges faster in both the 
function value and relative error. 

We experimented on various combinations of (m, k) with noisy data and observed similar phenomenon. 
As is the case in Figures FIT21 and EQl the relative error produced by the ADM algorithms tends to eventually 
increase, after the initial decrease when problem (|1.5[) is solved to a high accuracy. This implies, as suggested 
in Section 14. 1\ that it is unnecessary to solve £i-problcms to a higher accuracy than what is warranted by 
the accuracy of the underlying data. Therefore, in the next set of tests we terminate PADM and DADM 
whenever (|4. 1[) is satisfied with e = 5x 10~ 4 instead of using much smaller tolerances. 

Next we compare PADM and DADM with FPC-BB and SpaRSA for various combinations of (m,k). 
Here we do not present results for FISTA and CGD because they have been found to be less competitive in 
this set of tests. As is already mentioned earlier, without heuristic continuation and line search techniques, 
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FISTA is slower than ADM algorithms. On the other hand, CGD is slower in terms of decreasing relative 
error. We set all parameters as default in FPC-BB and use the same setting as before for SpaRSA except it 
is terminated when relative change in function values falls below 10~ 4 . For each fixed pair (m,k), we take 
the average of 50 runs on random instances. Detailed results including iteration number (Iter) and relative 
error to the true sparse signal (RclErr) are given in Table |4~T1 



Table 4.1 

Comparison results on (1 1 . 5 I t (sigma = 10 — 3 , fi = 10 -4 , average of 50 runs). 
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1.10E-1 


210.2 


1.65E-1 


126.2 


2.00E-1 



As can be seen from Table 14. 1[ in all cases PADM and DADM obtained smaller relative errors in 
comparable numbers of iterations as FPC-BB and SpaRSA. This is particularly true for more difficult 
problems, e.g., for m/n = 0.1 and k/m = 0.2 where the resulting relative errors of PADM and DADM 
are considerably smaller than those of FPC-BB and SpaRSA. We also tried to terminated SpaRSA and 
FPC-BB using more stringent stopping rules. Specifically, we set xtol=10 -5 and gtol=0.02 in FPC-BB 
and terminated SpaRSA when relative change in function values falls below 10 -7 . The relative error results 
cither remain roughly the same as those presented in Table |4~TI or were just slightly better, while the iteration 
numbers required by both FPC-BB and SpaRSA were increased from around 50% to more than 100%. We 
also tested partial DCT and sparse signals of different dynamic ranges and observed similar phenomenon. 

4.4. Comparison with SPGL1 and NESTA. In this subsection, we compare PADM and DADM 
with SPGL1 and NESTA for solving (|1.4j) . The same as before, xbar is random Gaussian spikes, and the 
standard deviation sigma of additive noise is 10 -3 . Parameter 5 in (|1 .4[) was set to be the 2-norm of additive 
noise (the ideal case). As in the previous experiment, we performed two sets of tests. In the first set, we ran 
PADM and DADM to a prescribed number of iterations and terminated SPGL1 and NESTA with suitably 
chosen tolerances so that the iteration numbers consumed arc roughly identical. Specifically, we set optTol 
= 10~ 8 in SPGL1, which is a tolerance for optimality, and TolVar = 10~ 8 in NESTA, which is a tolerance 
in relative change in objective function. All other parameters are set to their default values. Figure 14.51 
presents average results of 50 random problems, where two combinations of m and k are used. The resulting 
relative error and fidelity residue (i.e., \\Ax — are plotted as functions of iterations. 

As can be seen from the left column of Figure |4~51 compared with SPGL1 and NESTA, both PADM and 
DADM attained smaller relative errors throughout most of the iteration process (with the exception at the 
very beginning). With no more than 100 iterations, both PADM and DADM reached lowest relative errors 
and then started to increase slightly. It is interesting to note that, as can be seen from the right column of 
Figure 14.51 NESTA is the fastest in terms of decreasing fidelity residue but slowest in terms of decreasing 
the relative error. In the applications of compressive sensing, we are interested in recovering the true signal, 
which means that the smaller the relative error is, the better. 
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Fig. 4.5. Comparison results of F 'ADM, DADM, SPGL1 and NESTA on 111.41 1. The x-axes represent number of iterations; 
y-axes represent relative error (plots on the left) and residue (plots on the right). The standard deviation of Gaussian noise is 
1CT 3 . The results are average of 50 runs. 



After extensive experiments under various scenarios, including different sensing matrices and sparse 
signals of different dynamic ranges, we have found that when data are noisy the proposed ADM algorithms 
decrease relative error faster than all other tested algorithms. 

In the second set of tests, we terminated PADM and DADM by (|4~Tj) with e = 5 x 1CT 4 . For SPGL1 
and NESTA, we set all parameters as default except TolVar is set to be 10~ 6 in NESTA (where the default 
value is 1CP 5 ) to obtain solutions of comparable quality. The average results on 50 random problems are 
given in Table l4~2l As mentioned before, matrix- vector multiplications are the main computational load for 
all compared algorithms. The number of matrix- vector multiplications consumed by PADM and DADM is 
two times the number of iterations. However, the number used by SPGL1 is not always proportional to 
the number of iterations. As for NESTA, its per-iteration cost is also two matrix-vector multiplications for 
partial orthonormal matrices, although this number increases to six for general matrices. Therefore, instead 
of iteration numbers, we present in Table l4~2l the number of matrix-vector multiplications, denoted by #AAt 
that includes both A*x and A'*y. 

As can be seen from Table [OJ compared with SGPL1 and NESTA, both PADM and DADM obtained 
solutions of comparable quality within smaller numbers of matrix-vector multiplications. 

4.5. Comparison with SPGL1 on the basis pursuit problem. In this subsection, we compare 
DADM with SPGL1 on the basis pursuit problem (|1.3[) , in which case b is noiseless and solving problems 
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Table 4.2 

Comparison results on 11.41 (sigma = 10 -3 , <5 = norm (noise) , average of 50 runs). 



11 = 


8192 


PADM 


DADM 


SPGL1 


NESTA 


m/n 


k/m 


#AAt 


RelErr 


#AAt 


RelErr 


#AAt 


RelErr 


#AAt 


RelErr 


0.3 


0.1 


107.9 


6.40E-3 


97.8 


6.35E-3 


120.8 


5.39E-3 


300.6 


5.83E-3 


0.3 


0.2 


118.1 


6.05E-3 


108.2 


6.26E-3 


160.1 


7.26E-3 


303.6 


8.33E-3 


0.2 


0.1 


124.9 


8.01E-3 


112.7 


7.92E-3 


149.7 


7.35E-3 


316.9 


6.85E-3 


0.2 


0.2 


122.6 


1.05E-2 


110.0 


1.05E-2 


168.2 


1.62E-2 


311.5 


6.41E-2 


0.1 


0.1 


167.3 


1.29E-2 


152.2 


1.26E-2 


171.9 


1.29E-2 


340.6 


1.73E-2 


0.1 


0.2 


161.2 


9.86E-2 


145.0 


1.06E-1 


184.0 


1.49E-1 


326.4 


2.96E-1 



Table 4.3 

Comparison results on 11.31 (b is noiseless; stopping rule: e = 10 -6 in 14.111 ; average of 50 runs). 



11 = 


8192 


DADM 


SPGL1 


m/n 


k/m 


RelErr 


RclRes 


CPU 


#AAt 


RelErr 


RclRes 


CPU 


#AAt 


0.3 


0.1 


7.29E-5 


4.41E-16 


0.44 


258.8 


1.55E-5 


9.19E-6 


0.39 


114.9 


0.3 


0.2 


7.70E-5 


4.65E-16 


0.78 


431.4 


2.50E-5 


6.77E-6 


1.11 


333.4 


0.2 


0.1 


4.26E-5 


4.54E-16 


0.66 


388.2 


3.39E-5 


1.51E-5 


0.45 


146.7 


0.2 


0.2 


7.04E-5 


4.85E-16 


1.15 


681.8 


1.40E-4 


1.03E-5 


2.50 


791.0 


0.1 


0.1 


4.17E-5 


4.86E-16 


1.11 


698.2 


1.25E-4 


3.26E-5 


0.64 


207.9 



to a higher accuracy should result in higher-quality solutions. Thus, we terminated DADM with a stringent 
stopping tolerance of e = 10~ 6 in (|4.ip . All parameters in SPGL1 arc set to be default values. Detailed 
comparison results are given in Table l4~3l where, besides relative error (RelErr) and the number of matrix- 
vector multiplications (#AAt), the relative residue RelRes = \\Ax — and CPU time (in seconds) are 
also given. 

As can be observed from previous experimental results, the proposed ADM algorithms may slow down 
after a fast convergence stage at the beginning. When measurements are free of noise and a highly accurate 
solution is demanded, the ADM algorithms can sometimes be slower than SPGL1. Table l4~3l shows that 
DADM is slower than SPGL1 in three of the five test cases in terms of CPU seconds while getting slightly 
lower accuracy (the results for m/n = 0.1 and k/m = 0.2 are omitted since both algorithms failed to recover 
an accurate solution). We note that since SPGL1 requires some non-trivial calculations other than matrix- 
vector multiplications, a larger #AAt number by DADM does not necessarily lead to a longer CPU time. Wc 
also comment that the relative residue results of DADM are always numerically zero because when AA* = I 
the sequence {x k } generated by DADM, applied to (|1.3p . satisfies Ax k+1 — b = (1 — j)(Ax k — b) and thus 
\\Ax — b\\ decreases geometrically. 

4.6. Summary. Wc provided supporting evidence to emphasize the important point that algorithm 
speed should be evaluated relative to solution accuracy. Since more often than not measurements are noisy 
applications, solving ^-problems to a very high accuracy is generally not warranted and unnecessary. It is 
more practically relevant to evaluate the speed of an algorithm based on how fast it achieves an appropriate 
accuracy that is consistent with noise levels in data. 

We presented extensive experimental results on ^-problems and compared the proposed first-order, 
primal-dual algorithms, derived from the ADM approach, with state-of-the-art algorithms FPC-BB, SpaRSA, 
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FISTA, CGD, SPGL1 and NESTA. Our numerical results show that on the tested cases the proposed ADM 
algorithms are efficient and stable. In particular, under practically relevant conditions where data contain 
noise and stopping tolerances are set to appropriate values (where a more stringent tolerance would not lead 
to a more accurate solution), the proposed ADM algorithms generally achieve lower relative errors within 
fewer or comparable number of iterations in comparison to other tested algorithms. 

The experimental results also indicate that the dual-based ADMs are generally more efficient than the 
primal-based ones. One plausible explanation is that when A is orthonormal, the dual-based algorithms are 
exact ADMs, while the primal ones are inexact which solve some subproblcm approximately. The dual-based 
ADMs have been implemented in a MATLAB package called YALL1 [57] (short for Your ALgorithm for LI), 
which is applicable to eight different £i-modcls including Q1.3p . (|1.4[) . (| 1 . 5|) . (|1.6[) and the corresponding 
nonncgative counterparts. 

5. Concluding remarks. We proposed to solve ^-problems arising from compressive sensing by first- 
order, primal-dual algorithms derived from the Alternating Direction Method (ADM) framework which 
is based on the classic augmented Lagrangian function and alternating minimization idea for structured 
optimization. This ADM approach is applicable to numerious ^-problems including (|1.3p , (|1.4p , (|1.5p , (|1.6p 
and their nonncgative counterparts, as well as others. When applied to the £i-problcms, the per-iteration cost 
of these algorithms is dominated by two matrix- vector multiplications. Extensive experimental results show 
that the proposed ADM algorithms, especially the dual-based ones, perform competitively with several state- 
of-the-art algorithms. On various classes of test problems with noisy data, the proposed ADM algorithms 
have unmistakably exhibited the following advantages over competing algorithms in the comparison: (i) they 
converge relatively fast without the help of a continuation or a line search technique; (ii) their performance is 
relatively insensitive to changes in model and algorithmic parameters; and (iii) they demonstrate a notable 
ability to quickly decrease relative error to true solutions. Although the ADM algorithms are not necessarily 
the fastest in reaching an extremely high accuracy when observed data are noiseless, they are arguably the 
most effective in obtaining the best achievable level of accuracy whenever data contain a nontrivial level of 
noise, which is the case most relevant to practical applications. 

The most influential feature of the ADM approach is perhaps its great versatility and its seemingly 
universal effectiveness for a wide range of optimization problems in signal, image and data analysis, particular 
those involving £i-likc rcgularizations such as nuclear-norm (sum of singular values) rcgularization in matrix 
rank minimization like the matrix completion problem |40[ [8[ [TO] . or the total variation (TV) regularization 
[42] widely used in image processing. While the nuclear-norm is just an extension of ^i-norm to the matrix 
case, the TV regularization can be converted to ^i-regularization after introducing a splitting variable [481152] , 
Therefore, the ADM approach is applicable to both nuclear-norm and TV regularized problems (in either 
primal or dual form) in a rather straightforward manner so that the derivations and discussions are largely 
analogous to those for ^-problems as presented in this paper. Recently, the ADM has also been applied to 
total variation based image reconstruction in [20[ I52[ 134] and to semi-definite programming in [49 . A more 
recent application of the ADM approach is to the problem of decomposing a given matrix into a sum of a 
low-rank matrix and a sparse matrix simultaneously using ^i-norm and nuclear-norm regularizations (see 
jTT]). An ADM scheme has been proposed and studied for this problem in [55] . 

Although the ADM approach is classic and its convergence properties have been well studied, its re- 
markable effectiveness in signal and image reconstruction problems involving £i-like regularizations has just 
been recognized very recently. These fruitful new applications bring new research issues, such as convergence 
of certain inexact ADM schemes, that should be interesting for further investigations. 
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Proof. Let (f, x) be any solution of (|2.5[) . From optimization theory, there exists y £ C m such that the 
following conditions are satisfied: 

r/fj,- y = 0, A*y £ d\\x\\i and Ax + f = b. (A.l) 

For convenience, we let f = r k+1 , 2; = x fc+1 and y = y k — [3(Ax + f — b). As such, y k+1 = y k — j(y k — §)■ 
From the definition of f, x and y, equation (|2.7[) can be reformulated as f//i — y + (3A(x k — x) = 0. Further 
considering f/fi — y = 0, we have y — y — /^^(a;^ — &) = (f — r)//x, and thus 

(f - f)* (y - y - (3A(x k - x)) = \\f - f\\ 2 /fi > 0. (A.2) 

Similarly, equation (|2.1ip is equivalent to A*y — (3A*A(x k — x) + ^(x k — x) £ 9||x||i. Further considering 
A*y £ d\\x\\i and the convexity of || • ||i, it can be shown that 

(x - $y (a*($ -y)- (3A*A(x k - x) + ^(x k ~ x)j > 0. (A.3) 

From Ax + r = b and [3 (Ax + f — b) = y k — y, the addition of (|A.2[) and (|A.3|) gives 

\(y - y)*(y k -y) + -(x- x)*(x k -x)> ( y k - y)*A(x k - x). (A.4) 

P T 

Let /„ be the identity matrix of order n. For convenience, we define 



x 



By using this notation and considering equality u — u = (u — u k ) + (u k — u), (|A.4|) implies 

(u k - u)*Gi(« fc - u) > \\u k - uf Gi + (y k - y)*A(x k - x). (A.6) 
The iteration of u in (|2.13p can be written as u k+1 = u k — Go(u k — u), from which it can be shown 
l\\% - \\u k+1 -u\\ 2 G = 2(u k - fi)*G x (u* - u) - \\G (u k - u)f G 



\u k - u\\ 



(from jM|) > 2\\u k - u\\ 2 Gi + 2(y k - y)*A(x k - x) - \\u k - u\\ 2 GoGGo 

(from jX5}) = ^\\x k - x\\ 2 + 2 — l\\y k - y\\ 2 + 2(y k ~ y)* A(x k - x) (A.7) 
r (3 

From condition rA max + 7 < 2. it holds that 6 = 1- rA max /(2 - 7) > 0. Let p = (2 - j)/(/3 + 135) > 0, 
inequality (|A.7[) implies 

\\u k -u\\%- \\u k+1 - uf G > ?-\\x k - i|| 2 + - v\\ 2 p\\y k -y\\ 2 - - p \\A(x k -x)\\ 2 

p A max ^ M __ fe .2,(2 — 7 \ .. k „ 2 



(from definitions of 8 and p) = — — ||a; fe — i|| 2 H t— -\\y k — ii\\ 2 

v t [3 1 + 5" 11 

(from definitions of x, y and G) = r]\\u k — u k+1 \\ G , (A. 8) 

where 77 = min (s 2 , ^Jy) > 0. It follows from fO)) that 

(a) \\u k — u fe+1 ||G converges to 0, and thus lim^oo Ax k + r k = b; 
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(b) {u k } lies in a compact region; 

(c) \\u k — u\\q is monotonically non-increasing and thus converges. 

From (b), {u k } has a subsequence {u kj } converges to u* = (x*; y*). From y kj — > y*, lim^oo Ax k + r k = b 
and the iterative formula for y, we have y k — > y*. From (c), \\u k — u\\q — * \\u* — u\\q, which, by further 
considering y k — > y*, implies that any limit point of {x k }, if more than one, must have an equal distance 
to x. For convenience, we let z(x,y) = x — tA*(Ax — b — y/(3)/(l + (j,(3). By eliminating r fc+1 , the second 
equation in (|2.13p becomes x k+1 = Shrink(z(a;' c , y k ), t/(3), which implies 

x k 3 +i = Shrink {z{x k i,y k i),T/0) -> Shrink (z(x* ,y*),r/P) = x**. 

Thus, x** is also a limit point of {x k } and must have an equal distance to x as x* does, i.e., 

\\x* -Z\\ = ||a;** = || Shrink (z(x*,y*), t/0)- Shrink (z(x,y),T//3) ||, (A.9) 

where the second equality is because x = Shrink (z(x,y), t//3). From the property of Shrink(-, r//3), equality 
(|A.9j) implies (see e.g., [29]) 

x*-x = Shrink {z(x*, y*),r//3) - Shrink {z(x,y),T//3) = Shrink (z (x*, y*), t/0) - x. 

Thus, x* = Shrink (z(x* , ?/*), t//3). By letting r* = b — Ax*, it is easy to show from the above discussions 
that (r*, x*,y*) is a solution to (|2.5j) . By letting u = (x,y) = (x*, y*) = w* and f = r* at the beginning, we 
get the convergence of {u k } and thus that of {r k ,x k , y k }. □ 



